Gender and language-variety Identification with MicroTC

نویسندگان

  • Eric Sadit Tellez
  • Sabino Miranda-Jiménez
  • Mario Graff
  • Daniela Moctezuma
چکیده

In this notebook, we describe our approach to cope with the Author Profiling task on PAN17 which consists of both gender and language identification for Twitter’s users. We used our MicroTC (μTC) framework as the primary tool to create our classifiers. μTC follows a simple approach to text classification; it converts the problem of text classification to a model selection problem using several simple text transformations, a combination of tokenizers, a term-weighting scheme, and finally, it classifies using a Support Vector Machine. Our approach reaches accuracies of 0.7838, 0.8054, 0.7957, and 0.8538, for gender identification; and for language variety, it achieves 0.8275, 0.9004, 0.9554, and 0.9850. All these, for Arabic, English, Spanish, and Portuguese languages, respectively.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Including Dialects and Language Varieties in Author Profiling

This paper presents a computational approach to author profiling taking gender and language variety into account. We apply an ensemble system with the output of multiple linear SVM classifiers trained on character and word ngrams. We evaluate the system using the dataset provided by the organizers of the 2017 PAN lab on author profiling. Our approach achieved 75% average accuracy on gender iden...

متن کامل

PAN 2017: Author Profiling - Gender and Language Variety Prediction

We present the results of gender and language variety identification performed on the tweet corpus prepared for the PAN 2017 Author profiling shared task. Our approach consists of tweet preprocessing, feature construction, feature weighting and classification model construction. We propose a Logistic regression classifier, where the main features are different types of character and word n-gram...

متن کامل

Author Profiling at PAN: from Age and Gender Identification to Language Variety Identification (invited talk)

Author profiling is the study of how language is shared by people, a problem of growing importance in applications dealing with security, in order to understand who could be behind an anonymous threat message, and marketing, where companies may be interested in knowing the demographics of people that in online reviews liked or disliked their products. In this talk we will give an overview of th...

متن کامل

Using Character n-grams and Style Features for Gender and Language Variety Classification

Author profiling is the problem of determining the characteristics of an author of an anonymous text. In this paper, we detail a method to determine the language variety and the gender of the authors of tweets, as a submission for the Author Profiling Task at PAN 2017. This method seeks to select the most significant character n-grams for each class considered, combining them with style feature...

متن کامل

Overview of the 5th Author Profiling Task at PAN 2017: Gender and Language Variety Identification in Twitter

This overview presents the framework and the results of the Author Profiling task at PAN 2017. The objective of this year is to address gender and language variety identification. For this purpose a corpus from Twitter has been provided for four different languages: Arabic, English, Portuguese, and Spanish. Altogether, the approaches of 22 participants are evaluated.

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2017